Problem Overview and solution approach:

Objective:

Data dictionary:

Let's start by importing necessary libraries

Load and overview the dataset

Observations:

  1. The average customer age is around 46 years with the minimum being 26 years old and maximum being 73 years old.
  2. On average months on book by each customer is 40 months with the maximum being 56 months.
  3. There is a big range in Credit limit with lowest being 1438 and highest limit being 34516.
  4. Total amount changed between q4 and q1 values are all mostly below 1 with max being 3.39.
  5. Average utlization ratio is as low as 0 and as max as 0.99.

EDA

Univariate Analysis

Observations on Customer Age

Observations:

  1. 2 Outliers are visible.
  2. The histogram graph seems to show that the variable is normally distributed as the shape is of a bell-curve.
  3. The average age is 46 years.

Observations on Months on Book

Observations:

  1. There are outliers on both end of the boxplot
  2. Most of the values converge to the mean around 35.9 months.
  3. Most of the values are located between 25 and 45 months.

Observations on Credit Limit

Observations:

  1. All the visible outliers are above the maximum, $34,516
  2. The graph is slighly right skewed.
  3. There is a big range in the data

Observations on Total Revolving Blanace

Observations:

  1. No outliers seem to be visible.
  2. Data mostly clustered around 0 to 100 and from 1000 to 2500.

Observations on Average Open to Buy

Observations:

  1. Outliers visible after the maximum value, $34,516
  2. The graph is rightly skewed.

Observations on Total Amount Changed Q4 to Q1

Observations:

  1. Outliers visible both to the left and right of the boxplot.
  2. Majority of outliers visible to the right of the boxplot
  3. The graph is slighyly skewed to the right.
  4. Large accumulation of the data from 0 to 1.

Observations on Total Trans Amount

Observations:

  1. Outliers primarily visible to the right of the boxplot.
  2. The two peaks are visible between 6000 to 10000 and from 13000 to 15000.
  3. Most of the data is accumulated from 0 to 5000.

Observations on Total Trans CT

Observations:

  1. One outlier value visible after the maximum value.
  2. Two peaks visible at 40 and 80.
  3. Most of the values are accumulated between 40 to 90.

Observations on Total_CT_Change Q4 to Q1

Observation:

  1. There are many outliers visible.
  2. The graph is skewed to the right.
  3. This variable can be subjected to min, max scalar

Observations on Avg Utilization Ratio

Observations:

  1. No outliers visible.
  2. mean of the avg utlization ratio is 0.27.

Outlier Treatment

Observations:

  1. The outliers that were visible from the analysis of the continous vairables is no longer visible

Let's define a function to create barplots for the categorical variables indicating percentage of each category for that variables.

Observations:

  1. Existing customer is the most frequent.
  2. Female customers are more frequent.
  3. Most customers are graduate customers.
  4. Married customers are more frequent.
  5. Most customers use the blue card.
  6. Total Relationship count, months inactive and contacts count are all 3.

Observations on Attrition FLag

Observations:

  1. Most common set of customers are existing customers with 83.9%
  2. Attried Customers are 16.1%

Observations on Gender

Observations:

  1. Females are most common customers with 52.9%.
  2. Males have 47.1% visibily

Observations on Dependent Count

Observations:

  1. 2 and 3 dependents are more common with 26.2% and 27.0%
  2. Least common dependents are 5 ith 4.2%.

Observations on Education Level

Observations:

  1. Most common customers are graduate students with 30.9%
  2. Least common are doctorates with 4.5% and post-graduates with 5.19%

Observations on Marital Status

Observations:

  1. Divorced are the least common customers with 7.4%.
  2. Married customers are most common with 46.3% folloed by single customers with 38.9%.

Observations on Income Category

Observations:

  1. Most common customers are those that make less than 40K.
  2. Least common customers make more than $120K

Observations on Card Category

Observations:

  1. Most common used card is blue with 93.2%
  2. Least common used card type is platinum wiht 0.2% and gold with 1.1%.

Observations on Total Relationship Count

Observations:

  1. 3 is the most total relationship count that is common.
  2. Least common total relationship count is 1 with 9%.

Observations on Months inactive - 12 month period

Observations:

  1. 3 is the most common number of months inactive followed by 2 and 1.
  2. Least number of months inactive is 0 months with 0.3%.

Observations on Conctacts count 12 month period

Observations:

  1. The most frequent contacts count is 3 followed by 2, 4 and 1 with 33.4%, 31.9%, 13.7% and 14.8% respectively.
  2. Least frequent cotancts count is 6 with 0.5% and 0 with 3.9%.

Bivariate Analysis

Observations:

  1. Customer Age and months on book have a positive correlation with 0.77 correlation, indicating a strong correlation.
  2. Credit limit and Avgerage of credit limit have a near perfect correaltion with 0.99 which makes sense as these two variables are complementary to each other.
  3. Avg open ration and Avg utilization ratio have strong negative correatlion of -0.6.
  4. Total trans CT and total trans Amt have strong correlation of 0.86, which makes sense as the number of transactions goes up the amount of amount taken will likely increase, over a duration of period. 5.Average utilization ratio and Total revolving balance have strong correlation with 0.7.

Let's define one more function to plot stacked bar charts

Attrition FLag vs Gender

Observations:

  1. The distribution between M and F attried customer vs Existing customer is about the same ratio

Attrition Flag vs Dependent Count

Observations:

  1. The ratio of attried customer to existing customer is most same accros each value of dependent counts

Attrition Flag vs Education Level

Observations:

  1. The ratio fo attried customer to existing customer is maintained equally among all education levels.
  2. There are more existing customers than attried customers accross all education levels.

Attrition Flag vs Martial Status

Observations:

  1. There are more existing customers than attried customers accross all martial status.
  2. The existing customer to attried customer is maintained uniform accorss all martial status.

Attrition flag vs Income Category

Observation:

  1. The attried customer to existing customer ratio is uniformly mainted against all categories.
  2. Existing customers are more prelevant compared to attried customers.

Attrition Flag vs Card Category

Observations:

  1. The ratio of exisitng to attried customers is around 0.8 to 0.2.
  2. More existing cusomters compared to attried customers

Attrition Flag vs Total Relationship Count

Observations:

  1. The ratio of attried customers to existing customers is around 0.75 to 0.25
  2. There are more exisitng customers compared to attried customers.

Attrition Flag vs Months Inactive 12 month period

Observations:

  1. There are equal distubtion of attired customes to existings customers at 0.5 each at 0 months inactive in a 12 month period.
  2. There are significatly more existing customers than attrted customers at 1 month inactive.

Attrition Flag vs Contacts count 12 month period

Observations:

  1. There is nearly only existing customers in 0 contact months.
  2. While there are only attritied customers in 6 contact months.
  3. The ratio of attried customers to existings customers increase from 1 cotacts count to 5.

Attrition Flag vs Continous Variables

Observations:

  1. For customer age on average the age remains same accross attrited and existing customers.
  2. Months on book appears on average be equal for attritied and existing customers.
  3. For Total transmited amount, ct and CT chang between Q4 and Q1 exising customers are higher than attrited customers on average.

Observations on customer profile for Existing customers:

  1. On average the customer age is around 46 years old.
  2. Females are prelavant
  3. Number of dependents are about 3.
  4. Graduate, married and thsoe who make below 40K are visible.
  5. Blue card is mostly used, with the total relationship count, months inactiv and contacts count being aroun 3.
  6. Credit limit on average is arbout $8000. 7.The total amount changed and total Ct changed is about 0.73
  7. Average amount left on credit is about 6735.

Correlation Heatmap

Observations:

  1. Customer Age and months on book are positivly correlated with 0.79, this makes sense as the longer one has an account the older the person will be.
  2. Credit limit as near perfect correaltion with Avg Open to Buy at 0.99
  3. Total Revolving balance and Average utlization ratio have about 0.62 correlation.
  4. Total trans ct and Total trans amount have 0.86 correlation as the number of transactions increase the overall total amount transacted should also be big.

Splitting the dataset into Train, Valdiation and Test set

Missing-Value Treatment

Imputing Missing Values

Model evaluation criterion

We will be using Recall as a metric for our model performance because here company could face 2 types of losses

For model building we will be first building 6 models from Logistic regression, Bagging, GBM, AdaBoost, XGBoost and decision tree, we will run this with K-cross validation and then on the validation data set and we will measure the recall score of each of the models. From the 6 models, we will choose the 3 best models with the best recall score on the validation set.

Observations:

  1. From the bar plots other all the models have good cross-validation scores and performed well on the validation set.
  2. The recall score on the validation set for all of the is about 95% and above.
  3. We will choose the 3 best models from here, which are: Gradient boosting with 99.9%, XGboost with 98.8% and Adaboost with 98.2%

Hyperparameter Tuning

We will tune Adaboost, Gradient boosting and xgboost models using RandomizedSearchCV.

First, let's create two functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

AdaBoost

Observations:

  1. There appears to be no overfitting in the data sets.
  2. We get perfect recall score of 1, and good accuracy scoes as both accuracy scores are at 0.83.
  3. Need to test this with oversampled and undersampled data.

Gradient Boosting

Observations:

  1. No overfitting is observed.
  2. Good scores across both testing and validation data sets.
  3. All score metrics are above 90% suggesting a good generalized model.

XGBoost

Observation:

  1. No overfitting of data observed in the data sets.
  2. Perfect recall score on both train and validation data set.
  3. All other score metrics also perform well as all scores are at 0.84 and above.

Changing the sampling size of data by oversampling and undersampling

Oversampling train data using SMOTE

AdaBoost on Oversampled Data

Observations:

  1. There is a slight variance in accuracy scores between the train and validation test set, but no big indication of overfitting.
  2. THe recall score is slighly lower than the previous Adaboost.
  3. Good performance in precision, but since that isn't our evaluation metric we will look at other models.

Gradient Boosting on Oversampled data

Observations:

  1. Good scores in all of metrics.
  2. No overfitting visible.
  3. Similarly to GBC tuned, GBC on oversampled data has performed well in all metrics.

XGBoost on oversampled data

Observations:

  1. Overfitting is visible with XGBoost on oversampled data. There is disparity in the accuracy scores.
  2. The recall score is good on both the train and validation set, as the recall score is perfect 1, suggesting a really good model.
  3. Other metrics are also performing well.

Undersampling train data using Random Under Sampler

Adaboost on undersampled data

Observations:

  1. Overall a good accuracy scores on both train and validation data sets. No over fitting visible.
  2. Ada boost hasn't performed well in oversampled, undersampled or the tuned model.
  3. Not getting the best recall score on either train or validation set.

Gradient boosting on Undersampled data

Observations:

  1. The accuracy scores are a good indication of a good model.
  2. No overfitting visible.
  3. Good recall scores, at 87%.

XGBoost on undersampled data

Observations:

  1. The train set performance is slightly poor for both accuracy and percision.
  2. The recall scores for both train and validation data set is really good as it hits near 99%.
  3. Overall a good indication of a good model.

Comparing all models

Observations:

Observations:

  1. From the model performance on the test set we can confirm that this is a good generalized model.
  2. The accuracy, precision and recall scores are all above 0.9.
  3. The recall score of 0.99 indicates that this is a really good fit model and it caters well towards the data set.
  4. From the future importances, it is visible that Total transaction amount in the past 12 months, is the most important variable.
  5. Total Ct change Q4 to Q1 and Total trans CT are also very important variables.

Pipelines for productionizing the model

Column Transformer

Recommendation